Filter

Spark filter() function is used to filter the rows from DataFrame or Dataset based on the given condition or SQL expression, alternatively, you can also use where() operator instead of the filter if you are coming from SQL background. Both these functions are exactly the same.

val empDF = spark.createDataFrame(Seq(
      (7369, "SMITH", "CLERK", 7902, "17-Dec-80", 800, 20, 10),
      (7499, "ALLEN", "SALESMAN", 7698, "20-Feb-81", 1600, 300, 30),
      (7521, "WARD", "SALESMAN", 7698, "22-Feb-81", 1250, 500, 30),
      (7566, "JONES", "MANAGER", 7839, "2-Apr-81", 2975, 0, 20),
      (7654, "MARTIN", "SALESMAN", 7698, "28-Sep-81", 1250, 1400, 30),
      (7698, "BLAKE", "MANAGER", 7839, "1-May-81", 2850, 0, 30),
      (7782, "CLARK", "MANAGER", 7839, "9-Jun-81", 2450, 0, 10),
      (7788, "SCOTT", "ANALYST", 7566, "19-Apr-87", 3000, 0, 20),
      (7839, "KING", "PRESIDENT", 0, "17-Nov-81", 5000, 0, 10),
      (7844, "TURNER", "SALESMAN", 7698, "8-Sep-81", 1500, 0, 30),
      (7876, "ADAMS", "CLERK", 7788, "23-May-87", 1100, 0, 20)
    )).toDF("empno", "ename", "job", "mgr", "hiredate", "sal", "comm", "deptno")

empDF.show


filter with Column Condition
Use Column with the condition to filter the rows from DataFrame, using this you can express complex condition by referring column names using col(name), $"colname" dfObject("colname") , this approach is mostly used while working with DataFrames. Use “===” for comparison.

There are various alternate syntaxes that give you the same result and same performance.
  • df.where("column_Name = value")
  • df.where($"column_Name" === value)
  • df.where('column_Name === true)
empDF.filter("deptno = 30").show

empDF.where("deptno = 30").show

empDF.filter($"deptno" === 30).show

empDF.where($"deptno" === 30).show

 
empDF.filter('deptno === 30).show

empDF.where('deptno === 30).show

 
 empDF.filter($"job" === "CLERK").show
 
you can also use "where" in place "filter"
empDF.where(($"job" === "SALESMAN")).show
 
DataFrame filter() with SQL Expression
If you are coming from SQL background, you can use that knowledge in Spark to filter DataFrame rows with SQL expressions.
empDF.filter("deptno == 30").show


empDF.filter("job == 'CLERK'").show


Filter with multiple conditions
To filter() rows on DataFrame based on multiple conditions, you case use either Column with a condition or SQL expression. Below is just a simple example, you can extend this with AND(&&), OR(||), and NOT(!) conditional expressions as needed.
empDF.filter(($"job" === "SALESMAN") && ($"deptno" === 30)).show(false)



empDF.filter(($"job" === "SALESMAN") || ($"deptno" === 30)).show(false)


empDF.filter($"sal" > 1500).show

 
empDF.filter($"sal" <1500).show

No comments:

Post a Comment